Chapter 10 Bootstrap
10.0.1 Bootstrap resampling
Fitting once, how can we get confidence about that estiamte. Bootstrap resample is where we take the original dataset and select random data points from within it but in order to keep it the same size as the original dataset some records are duplicated. This is known as bootstrap resampling by replacement.
So here we will first get rid of all the coloumns we aren’t using in this model using the package dplyr which is part of the tidyverse as they will just slow us down. At the moment we only care about average GCSE scores and unauthorised absecne in schools.
library(rsample)
library(dplyr)
LonWardProfilesGCSE <- LonWardProfiles %>%
dplyr::select(`Average GCSE capped point scores - 2014`,
`Unauthorised Absence in All Schools (%) - 2013`)Next we are going to create our samples using the function bootstraps(). First we set the seed (to any number), which just means are results will be the same if we are to re-run this analysis, if we don’t then as we are randomly replacing data the next time we do run the code the results could change.
Here we have created 1000 versions of our original data (times=1000). Apparent means we are using our entire dataset and the full dataset will be kept in our GSCE_boot variable.
set.seed(99)
GCSE_boot <-st_drop_geometry(LonWardProfilesGCSE) %>%
bootstraps(times = 1000, apparent = TRUE)
GCSE_boot## # Bootstrap sampling with apparent sample
## # A tibble: 1,001 x 2
## splits id
## <list> <chr>
## 1 <split [626/234]> Bootstrap0001
## 2 <split [626/230]> Bootstrap0002
## 3 <split [626/220]> Bootstrap0003
## 4 <split [626/243]> Bootstrap0004
## 5 <split [626/222]> Bootstrap0005
## 6 <split [626/224]> Bootstrap0006
## 7 <split [626/226]> Bootstrap0007
## 8 <split [626/223]> Bootstrap0008
## 9 <split [626/226]> Bootstrap0009
## 10 <split [626/232]> Bootstrap0010
## # ... with 991 more rows
So when we look in GSCE_boot we have 100 rows
Here we are now going to run the linear regression model for each of our bootstrap resampled instances. To do this we are going to use the map() function from the purrr package that is part of the tidyverse. This lets us apply a function (so our regression formula) to a list (so our splits that we calculated from bootstrap resampling)
We need to map over and then given the function to what we are mapping over. So we are mapping (or changing) our splits (from the bootstrap resampled) then apply our regression formula to each split. To make sure we can save the output we will create a new column called model with the mutate() function from the dplyr package that is also part of the tidyverse. Mutate just lets us add a column to an existing table.
dplyr is possibly one of the most useful packages i’ve ever come across. It’s described as the grammar of data manipulation with a set of functions (here also called verbs) that will let you transform your data into almost any format. Let’s first finish our example with bootstrap resampling then we can have some fun with dplyr.
GCSE_models <- GCSE_boot %>%
mutate(
model = map(splits, ~ lm(`Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013`,
data = .)))
# let's look at the first model results
GCSE_models$model[[1]]##
## Call:
## lm(formula = `Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013`,
## data = .)
##
## Coefficients:
## (Intercept)
## 372.53
## `Unauthorised Absence in All Schools (%) - 2013`
## -41.48
However, this is a very ugly output and pretty hard to understand. Now remember when we just went through the magic of broom? You might have thought that it didn’t do very much! This is where it can really be of use. We want to tidy up our models results so we can actually use them. Firstly let’s add a new coloumn with mutate (coef_info) then take our model coloumn and using the map() function again tidy the outputs (saving the result into that new coef_info column).
So this has just added the information about our coefficents to a new column in the original table, but what if we actually want to extract this information, well we can just use unnest() from the tidyr package (also part of the tidyverse) that takes a list-column and makes each element its own row. I know it’s a list column as if you just explore GCSE_models_tidy (just run that in the console) you will see that under coef_info column title it says list.
## # A tibble: 2,002 x 8
## splits id model term estimate std.error statistic p.value
## <list> <chr> <lis> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 <split [~ Bootst~ <lm> (Intercept) 373. 1.92 194. 0.
## 2 <split [~ Bootst~ <lm> `Unauthorised ~ -41.5 1.70 -24.4 4.72e-93
## 3 <split [~ Bootst~ <lm> (Intercept) 374. 2.17 173. 0.
## 4 <split [~ Bootst~ <lm> `Unauthorised ~ -42.5 1.96 -21.6 7.17e-78
## 5 <split [~ Bootst~ <lm> (Intercept) 372. 2.16 172. 0.
## 6 <split [~ Bootst~ <lm> `Unauthorised ~ -42.3 1.91 -22.1 1.39e-80
## 7 <split [~ Bootst~ <lm> (Intercept) 372. 2.08 179. 0.
## 8 <split [~ Bootst~ <lm> `Unauthorised ~ -42.6 1.86 -22.9 9.17e-85
## 9 <split [~ Bootst~ <lm> (Intercept) 370. 2.27 163. 0.
## 10 <split [~ Bootst~ <lm> `Unauthorised ~ -39.7 2.00 -19.8 2.45e-68
## # ... with 1,992 more rows
If you look in the tibble GCSE_coef you will notice under the column heading term that we have intercept and unathorised absence values, here we just want to see the variance of coefficients for unauthorised absence, so let’s filter it, again this uses dplyr.
## # A tibble: 1,001 x 8
## splits id model term estimate std.error statistic p.value
## <list> <chr> <lis> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 <split [~ Bootst~ <lm> `Unauthorised ~ -41.5 1.70 -24.4 4.72e-93
## 2 <split [~ Bootst~ <lm> `Unauthorised ~ -42.5 1.96 -21.6 7.17e-78
## 3 <split [~ Bootst~ <lm> `Unauthorised ~ -42.3 1.91 -22.1 1.39e-80
## 4 <split [~ Bootst~ <lm> `Unauthorised ~ -42.6 1.86 -22.9 9.17e-85
## 5 <split [~ Bootst~ <lm> `Unauthorised ~ -39.7 2.00 -19.8 2.45e-68
## 6 <split [~ Bootst~ <lm> `Unauthorised ~ -40.2 1.89 -21.3 3.20e-76
## 7 <split [~ Bootst~ <lm> `Unauthorised ~ -45.1 1.89 -23.9 5.59e-90
## 8 <split [~ Bootst~ <lm> `Unauthorised ~ -41.5 1.84 -22.6 4.60e-83
## 9 <split [~ Bootst~ <lm> `Unauthorised ~ -40.1 1.94 -20.7 4.59e-73
## 10 <split [~ Bootst~ <lm> `Unauthorised ~ -41.1 2.07 -19.8 4.02e-68
## # ... with 991 more rows
Now let’d have a look at the histogram of our coefficients to see the distribution….
coef %>%
ggplot(aes(x=estimate)) +
geom_histogram(position="identity", alpha=0.5, bins=15, fill="lightblue2", col="lightblue3")+
geom_vline(aes(xintercept=mean(estimate)),
color="blue",
linetype="dashed")+
labs(title="Bootstrap resample estimates",
x="Coefficient estimates",
y="Frequency")+
theme_classic()+
theme(plot.title = element_text(hjust = 0.5))
Now, wouldn’t it be useful to get some confidence intervals based on our bootstrap resample. Well! we can using the int_pctl() from the rsample package, which forms part of tidymodels.
Here, we give the function our models (GCSE_models_tidy), tell it what column the statstics are in (coef_info), specify the level of significance (alpha). Now if you recall we set the apparent argument to true earlier on and this is because this is a requirement of int_pctl()
## # A tibble: 2 x 6
## term .lower .estimate .upper .alpha .method
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) 367. 371. 376. 0.05 percenti~
## 2 `Unauthorised Absence in All Schools~ -45.2 -41.2 -37.3 0.05 percenti~
A change in 1% of unathorised absence reduces the average GCSE point score between between 41.2 and 45.2 points (this is our confidence interval) with a 95% confidence level.
we can also change alpha to 0.01, which would be the 99% confidence level, but you’d expect the confidence interval range to be greater.
Here our confidence level means that if we ran the same regression model again with different sample data we could be 95% sure that those values would be within our range here.
Confidence interval = the range of values that a future estimate will be betweeen
Confidence level = the perctange certainty that a future value will be between our confidence interval values
Now lets add our predictions to our original data for a visual comparison, we will do this using the augment() function from the package broom, again we will need to use unnest() too.
GCSE_aug <- GCSE_models_tidy %>%
#sample_n(5) %>%
mutate(augmented = map(model, augment))%>%
unnest(augmented)There are so many rows in this tibble as the statistics are produced for each point within our original dataset and then added as an additional row. However, note that the statistics are the same within each bootstrap, we will have a quick look into this now.
So we know from our London Ward Profiles layer that we have 626 data points, check this with:
## [1] 626
Now, let’s have a look at our first bootstrap using filter() from dplyr
firstboot<-filter(GCSE_aug,id=="Bootstrap0001")
length(firstboot$Average.GCSE.capped.point.scores...2014)## [1] 626
Ok, so how do we know that our coefficient is the same for all 626 points within our first bootstrap? Well we can check all of them, but this would just print the coefficent info for all 262….
Another way is just to ask what are the unique values within our coefficient info?
## [[1]]
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 373. 1.92 194. 0.
## 2 `Unauthorised Absence in All Schools (%~ -41.5 1.70 -24.4 4.72e-93
Right, now it would be really useful if we could plot our original data, but instead of fitting a line of best fit between it we use all the lines from the bootstrap resamples to show the possible variance.
However, it really annoys me that sometimes column names get crazy, for example GCSE_aug$Average.GCSE.capped.point.scores...2014 just look at all those ……!!!!
Wound’t it be good if there was a way to just standardise all names? Well we can with the help from Janitor
So let’s clean up our names then use them in our plot.
library(janitor)
GCSE_aug <- janitor::clean_names(GCSE_aug)
ggplot(GCSE_aug, aes(unauthorised_absence_in_all_schools_2013,
average_gcse_capped_point_scores_2014))+
# we want our lines to be from the fitted column grouped by our bootstrap id
geom_line(aes(y = fitted, group = id), alpha = .2, col = "cyan3") +
# remember out apparent data is the original within the bootstrap
geom_point(data=filter(GCSE_aug,id=="Apparent"))+
#add some labels to x and y
labs(x="unauthorised absence in all schools 2013 (%)",
y="Average GCSE capped point scores 2014")
You’ll notice that this plot is slightly different to the one we first produced in Regression Basics as we’ve used geom_point() not geom_jitter() the latter adds a small amount of random variation to the points (not the data). If change in to geom_point() you will get the same plot as we just produced (apart from the bootstrap fitted lines obviously). We will use qplot() again, it is from the packaged quickplot and lets us make quick plots with similar code to ggplot, but avoid if when you have large amounts of data — use ggplot().
10.0.2 Assumptions Underpinning Linear Regression
10.0.3 Assumption 1 - There is a linear relationship between the dependent and Independent variables
The best way to test for this assumption is to plot a scatter plot similar to the one created earlier. It may not always be practical to create a series of scatter plots, so one quick way to check that a linear relationship is probable is to look at the frequency distributions of the variables. If they are normally distributed, then there is a good chance that if the two variables are in some way correlated, this will be a linear relationship.
For example, look at the frequency distributions of our two variables earlier:
#let's check the distribution of these variables first
ggplot(LonWardProfiles, aes(x=`Average GCSE capped point scores - 2014`)) +
geom_histogram(aes(y = ..density..),binwidth = 5) +
geom_density(colour="red", size=1, adjust=1)
ggplot(LonWardProfiles, aes(x=`Unauthorised Absence in All Schools (%) - 2013`)) +
geom_histogram(aes(y = ..density..),binwidth = 0.1) +
geom_density(colour="red", size=1, adjust=1)
We would describe both of these distribution as being relatively ‘normally’ or gaussian disributed - https://en.wikipedia.org/wiki/Normal_distribution - and thus more likely to have a linear correlation (if they are indeed associated).
Contrast this with the median house price variable:

We would describe this as a not normal and/or positively ‘skewed’ - https://en.wikipedia.org/wiki/Skewness - distribution, i.e. there are more observations towards the lower end of the average house prices observed in the city, however there is a long tail to the distribution, i.e. there are a small number of wards where the average house price is very large indeed.
If we plot the raw house price variable against GCSE scores, we get the following scatter plot:
qplot(x = `Median House Price (£) - 2014`,
y = `Average GCSE capped point scores - 2014`,
data=LonWardProfiles)
This indicates that we do not have a linear relationship, indeed it suggests that this might be a curvilinear relationship.
10.0.3.1 Transforming variables
One way that we might be able to achieve a linear relationship between our two variables is to transform the non-normally distributed variable so that it is more normally distributed.
There is some debate as to whether this is a wise thing to do as, amongst other things, the coefficients for transformed variables are much harder to interpret, however, we will have a go here to see if it makes a difference.
Tukey’s ladder of transformations
You might be asking how we could go about transforming our variables. In 1977, Tukey described a series of power transformations that could be applied to a variable to alter its frequency distribution.
In regression analysis, you analysts will frequently take the log of a variable to change its distribution, but this is a little crude and may not result in a completely normal distribution. For example, we can take the log of the house price variable:

This looks a little more like a normal distribution, but it is still a little skewed.
Fortunately in R, we can use the symbox() function in the car package to try a range of transfomations along Tukey’s ladder:

Observing the plot above, it appears that raising our house price variable to the power of -1 should lead to a more normal distribution:

qplot(x = (`Median House Price (£) - 2014`)^-1,
y = `Average GCSE capped point scores - 2014`,
data=LonWardProfiles)
Compare this with the logged transformation:
qplot(x = log(`Median House Price (£) - 2014`),
y = `Average GCSE capped point scores - 2014`,
data=LonWardProfiles)
10.0.3.2 Should I transform my variables?
The decision is down to you as the modeller - it might be that a transformation doesn’t succeed in normalising the distribtuion of your data or that the interpretation after the transformation is problematic, however it is important not to violate the assumptions underpinning the regression model or your conclusions may be on shaky ground.
10.0.4 Assumption 2 - The residuals in your model should be normally distributed
This assumption is easy to check. When we ran our Model1 earlier, one of the outputs stored in our Model 1 object is the residual value for each case (Ward) in your dataset. We can plot these as a histogram and see if there is a normal distribution:
#save the residuals into your dataframe
LonWardProfiles$model1_resids <- model1$residuals
qplot(model1$residuals) +
geom_histogram() 
Examining the histogram above, we can be happy that our residuals look to be relatively normally distributed.
10.0.5 Assumption 3 - No Multicolinearity in the independent variables
Now, the regression model we have be experimenting with so far is a simple bivariate (two variable) model. One of the nice things about regression modelling is while we can only easily visualise linear relationships in a two (or maximum 3) dimension scatter plot, mathematically, we can have as many dimensions / variables as we like.
As such, we could extend model 1 into a multiple regression model by adding some more explanatory variables that we think could affect GSCE scores. Let’s try the log or ^-1 transformed house price variable from earlier (the rationalle being that higher house prices indicate more affluence and therefore, potentially, more engagement with education):
model2 <- lm(`Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013` +
log(`Median House Price (£) - 2014`), data = LonWardProfiles)
#show the summary of those outputs
summary(model2)##
## Call:
## lm(formula = `Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013` +
## log(`Median House Price (£) - 2014`), data = LonWardProfiles)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.257 -10.219 -0.468 8.828 50.821
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 201.517 20.093 10.029
## `Unauthorised Absence in All Schools (%) - 2013` -36.206 1.919 -18.869
## log(`Median House Price (£) - 2014`) 12.786 1.504 8.503
## Pr(>|t|)
## (Intercept) <2e-16 ***
## `Unauthorised Absence in All Schools (%) - 2013` <2e-16 ***
## log(`Median House Price (£) - 2014`) <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.53 on 623 degrees of freedom
## Multiple R-squared: 0.4833, Adjusted R-squared: 0.4816
## F-statistic: 291.3 on 2 and 623 DF, p-value: < 2.2e-16
#and for future use, write the residuals out to a column in your dataframe
LonWardProfiles$model2_resids <- residuals(model2)Examining the output above, it is clear that including median house price into our model improves the fit from an \(r^2\) of around 42% to an \(r^2\) of 48%. Median house price is also a statistically significant variable.
But do our two explanatory variables satisfy the no-multicoliniarity assumption? If not and the variables are highly correlated, then we are effectively double counting the influence of these variables and overstating their explanatory power.
To check this, we can compute the Pearson correlation coefficient between the variables. In an ideal world, we would be looking for something less than a 0.8 correlation
library(corrplot)
#first drop the geometry column from the dataframe as it will cause problems
tempdf <- st_set_geometry(LonWardProfiles,NULL)
#pull out the columns we want to check for multicolinearity
tempdf <- tempdf[,c("Median House Price (£) - 2014", "Unauthorised Absence in All Schools (%) - 2013")]
#log houseprice
tempdf[1] <- log(tempdf[1])
#rename the columns to something shorter
names(tempdf) <- c("Med House Price", "Un Auth Absence")
#compute the correlation matrix for the two variables of interest
cormat <- cor(tempdf[,1:2], use="complete.obs", method="pearson")
#visualise the correlation matrix
corrplot(cormat)
Looking at either the correlation matrix or the correlation plot of that matrix, it’s easy to see that there is a low correlation (around 30%) between our two independent variables.
Multiple linear regression can be explained nicely with this example from allison_horst.
Let’s meet our Multiple Linear Regression teaching assistants:

Here my reseach question might be
What are the factors that might lead to variation in dragon weight?
We might have both cateogrical predictor variaibles such as if the dragon is spotted then they might weight 0.6 more tons than on average…So if the dragon is not spotted then in the equation bellow 0.6 x (0) — 0 as it’s not spotted, it would be 1 if spotted is 0 so the weight would be 2.4 tonnes.

But we could also have a continuous predictor variable …here a 1 foot taller dragon would have an addtional 0.3 tonnes on average, as the equation would be 2.4 + 0.3(1)…

We can use these equations to make predicitons about a new dragon…here the dragon is striped, so spotted equals 0, and it’s height is 5.1. So that gives us 2.4 + (0.3*5.1) which equals 3.9 tonnes.

However, we need to check the residuals — difference between the predicted weight and acutal weight…here it’s 0.3 tonnes as the actual weight of the dragon was 4.2 tonnes.

Then for all the dragons we need to make sure these resisuals are normally distributed, but you should also check all the other assumptions (1-5) shown within this practical and give brief evdience in reports that they are all valid.

10.0.5.1 Variance Inflation Factor (VIF)
Another way that we can check for Multicolinearity is to examine the VIF for the model. If we have VIF values for any variable exceeding 10, then we may need to worry and perhaps remove that variable from the analysis:
## `Unauthorised Absence in All Schools (%) - 2013`
## 1.105044
## log(`Median House Price (£) - 2014`)
## 1.105044
Both the correlation plots and examination of VIF indicate that our multiple regression model meets the assumptions around multicollinearity and so we can proceed further.
If we wanted to add more variables into our model, it would be useful to check for multicollinearity amongst every variable we want to include, we can do this by computing a correlation matrix for the whole dataset or checking the VIF after running the model:
tempdf <- st_set_geometry(LonWardProfiles,NULL)
cormat <- cor(tempdf[,10:74], use="complete.obs", method="pearson")
str(tempdf)## 'data.frame': 626 obs. of 75 variables:
## $ NAME : Factor w/ 604 levels "Abbey","Abbey Road",..: 102 531 33 9 36 130 101 508 394 490 ...
## $ GSS_CODE : chr "E05000405" "E05000414" "E05000401" "E05000400" ...
## $ HECTARES : num 755 259 145 269 188 ...
## $ NONLD_AREA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LB_GSS_CD : Factor w/ 33 levels "E09000001","E09000002",..: 21 21 21 21 21 21 21 21 21 21 ...
## $ BOROUGH : Factor w/ 33 levels "Barking and Dagenham",..: 21 21 21 21 21 21 21 21 21 21 ...
## $ POLY_ID : Factor w/ 625 levels "116694","116695",..: 461 181 321 327 182 180 354 328 326 322 ...
## $ Ward name : chr "Kingston upon Thames - Chessington South" "Kingston upon Thames - Tolworth and Hook Rise" "Kingston upon Thames - Berrylands" "Kingston upon Thames - Alexandra" ...
## $ Old code : chr "00AXGC" "00AXGM" "00AXFY" "00AXFX" ...
## $ Population - 2015 : num 10550 10650 9800 9700 10450 ...
## $ Children aged 0-15 - 2015 : num 2200 2150 1800 1900 2200 1950 1800 1900 1850 1350 ...
## $ Working-age (16-64) - 2015 : num 6800 7050 6500 6200 6900 7150 5800 7550 6100 8650 ...
## $ Older people aged 65+ - 2015 : num 1500 1450 1500 1600 1350 1600 1400 1450 1750 900 ...
## $ % All Children aged 0-15 - 2015 : num 20.9 20.1 18.2 19.7 20.8 18.3 20.1 17.4 19 12.2 ...
## $ % All Working-age (16-64) - 2015 : num 64.6 66.1 66.4 63.7 66 66.8 64.6 69.2 63 79.4 ...
## $ % All Older people aged 65+ - 2015 : num 14.4 13.8 15.4 16.6 13.1 14.9 15.3 13.5 18.1 8.3 ...
## $ Mean Age - 2013 : num 37.6 37.3 38.3 38.9 37.3 37.1 38.3 37.7 39.9 33.6 ...
## $ Median Age - 2013 : num 37 36 36 39 37 34 38 35 40 30 ...
## $ Area - Square Kilometres : num 7.6 2.6 1.5 2.7 1.9 4.4 1.9 1.7 1.8 1.4 ...
## $ Population density (persons per sq km) - 2013 : num 1375 3962 6467 3537 5447 ...
## $ % BAME - 2011 : num 13 27.2 18.1 29.8 32.6 35.5 15 19 30.8 21.3 ...
## $ % Not Born in UK - 2011 : num 14.3 26.1 24.1 27.2 32.4 40.6 17.2 25 27.3 29.1 ...
## $ % English is First Language of no one in household - 2011 : num 3.9 8 6.9 8.2 13.3 13.9 5 7 8.4 8.9 ...
## $ General Fertility Rate - 2013 : num 64.1 59.1 56.8 61.6 75 44.7 55 60 58.4 46.9 ...
## $ Male life expectancy -2009-13 : num 83.3 80.3 79 83.4 80.8 83.2 80.4 79.7 83.6 81.7 ...
## $ Female life expectancy -2009-13 : num 83.9 86.7 83.3 86.9 81.8 84.8 86.8 82.6 87.4 84.3 ...
## $ % children in reception year who are obese - 2011/12 to 2013/14 : num 8.1 4.9 4.6 7.6 6.5 4.9 9.8 4.2 6.7 2.9 ...
## $ % children in year 6 who are obese- 2011/12 to 2013/14 : num 24.6 15.2 8.6 20.6 13.1 21.4 23.8 9.8 16.4 8.7 ...
## $ Rate of All Ambulance Incidents per 1,000 population - 2014 : num 109.7 115.2 102.8 92.4 117.9 ...
## $ Rates of ambulance call outs for alcohol related illness - 2014 : num 0.5 0.6 0.4 0.4 0.6 0.7 0.7 0.4 0.3 0.7 ...
## $ Number Killed or Seriously Injured on the roads - 2014 : num 0 3 1 3 0 4 2 2 1 3 ...
## $ In employment (16-64) - 2011 : num 5291 4993 4813 4464 4807 ...
## $ Employment rate (16-64) - 2011 : num 78.5 76.2 74.8 75.1 71.6 64.5 76.4 78.2 74 69.4 ...
## $ Number of jobs in area - 2013 : num 5300 5200 2000 2500 4500 7200 2000 2100 1100 4300 ...
## $ Employment per head of resident WA population - 2013 : num 0.8 0.8 0.3 0.4 0.7 1 0.3 0.3 0.2 0.5 ...
## $ Rate of new registrations of migrant workers - 2011/12 : num 6.8 22.7 15.9 17.6 30 34.7 10.4 19 15.6 25.3 ...
## $ Median House Price (£) - 2014 : num 315000 337195 361125 404975 435000 ...
## $ Number of properties sold - 2014 : num 201 153 172 136 167 161 137 237 134 286 ...
## $ Median Household income estimate (2012/13) : num 38310 37840 42330 41390 40700 ...
## $ Number of Household spaces - 2011 : num 4112 3836 4447 3443 4047 ...
## $ % detached houses - 2011 : num 11.2 6.4 10.5 9.3 9.6 29.5 6.6 16.2 18.7 4.2 ...
## $ % semi-detached houses - 2011 : num 26.4 46.8 25.7 56.3 33.7 16.8 46.6 18.8 45.9 16.5 ...
## $ % terraced houses - 2011 : num 41.3 24.5 9 21.5 26.3 11.6 29.5 7.3 16.9 9.1 ...
## $ % Flat, maisonette or apartment - 2011 : num 20.9 22 54.7 12 30.4 42.1 17.2 57.7 18.5 70 ...
## $ % Households Owned - 2011 : num 77.3 71.2 54.6 80.2 63.7 61 74.8 60.5 77.4 53.1 ...
## $ % Households Social Rented - 2011 : num 11.7 10.7 17.4 4.9 11.5 11.6 14.8 7.5 9.1 6.7 ...
## $ % Households Private Rented - 2011 : num 9.7 16.5 26.9 13.3 22.1 25.5 8.6 30.7 12.3 38.6 ...
## $ % dwellings in council tax bands A or B - 2015 : num 4.1 4.6 8.9 4 5.5 1.1 1.8 4.3 3.1 8.4 ...
## $ % dwellings in council tax bands C, D or E - 2015 : num 89.2 91.8 70.4 79.4 76.4 51.7 96.2 74.7 63.6 80.2 ...
## $ % dwellings in council tax bands F, G or H - 2015 : num 6.7 3.6 20.7 16.6 18.1 47.2 2 21 33.3 11.4 ...
## $ Claimant rate of key out-of-work benefits (working age client group) (2014): num 7.3 6.7 7.3 5.5 7.2 4.9 8.4 5.2 6.7 4.8 ...
## $ Claimant Rate of Housing Benefit (2015) : num 6.4 6.8 9.8 5.1 8.8 5.4 6.3 5.4 5.4 5.5 ...
## $ Claimant Rate of Employment Support Allowance - 2014 : num 3.2 3 4.2 3.1 3.6 2 3.7 2.6 2.6 2.8 ...
## $ Rate of JobSeekers Allowance (JSA) Claimants - 2015 : num 1.3 1.3 1.4 1 1.8 1.3 1.4 1.1 1.1 1.5 ...
## $ % dependent children (0-18) in out-of-work households - 2014 : num 13.1 10.2 6.8 6.3 9.2 8.2 10.9 9.3 8.9 4.3 ...
## $ % of households with no adults in employment with dependent children - 2011: num 3.9 3.3 2.3 2.7 4.1 3.9 3.6 2.3 3.7 1.6 ...
## $ % of lone parents not in employment - 2011 : num 39.6 38.2 36.9 31.2 43.4 50.2 35.9 39.3 39.2 40.8 ...
## $ (ID2010) - Rank of average score (within London) - 2010 : num 548 528 530 590 544 570 516 568 566 537 ...
## $ (ID2010) % of LSOAs in worst 50% nationally - 2010 : num 16.7 0 16.7 0 16.7 14.3 0 0 16.7 16.7 ...
## $ Average GCSE capped point scores - 2014 : num 321 338 343 353 372 ...
## $ Unauthorised Absence in All Schools (%) - 2013 : num 0.8 0.7 0.5 0.4 0.7 0.9 0.8 0.6 0.7 0.5 ...
## $ % with no qualifications - 2011 : num 19.1 18.4 11.9 15.8 15.2 10.7 21.9 10.7 15.7 6.7 ...
## $ % with Level 4 qualifications and above - 2011 : num 25.3 30 48.4 32.7 41.7 44.5 22.8 52.2 34.4 48.3 ...
## $ A-Level Average Point Score Per Student - 2013/14 : num 643 714 734 762 762 ...
## $ A-Level Average Point Score Per Entry; 2013/14 : num 204 213 213 215 219 ...
## $ Crime rate - 2014/15 : num 47.5 53.7 31.6 34.7 64.5 47.8 43.9 29.6 29.6 49.7 ...
## $ Violence against the person rate - 2014/15 : num 16.3 14.2 8.2 7.9 15.4 12.6 13.9 10.1 8 13.5 ...
## $ Deliberate Fires per 1,000 population - 2014 : num 0.6 0.7 0.3 0.3 0.7 0.3 1.2 0.3 0.1 0.4 ...
## $ % area that is open space - 2014 : num 75.7 33.1 10.9 43.7 26.4 42.1 27.5 5.8 21.3 23.9 ...
## $ Cars per household - 2011 : num 1.4 1.2 1 1.4 1 1.3 1.3 1 1.4 0.9 ...
## $ Average Public Transport Accessibility score - 2014 : num 2.4 2.3 2.8 2.2 2.8 2.5 2.4 3.3 2.7 3.7 ...
## $ % travel by bicycle to work - 2011 : num 2.3 2.9 4.6 3.9 4.4 3.2 2.5 3.7 2.3 4.2 ...
## $ Turnout at Mayoral election - 2012 : num 30.4 32.1 38 36.3 41.2 34.9 30.8 34.7 36.1 30.2 ...
## $ model1_resids : num -17.18 -5.11 -8.15 -1.68 29.69 ...
## $ model2_resids : num -13.12 -1.41 -4.33 1.18 30.13 ...

10.0.6 Assumption 4 - Homoscedasticity
Homoscedasticity means that the errors/residuals in the model exhibit constant / homogenous variance, if they don’t, then we would say that there is hetroscedasticity present. Why does this matter? Andy Field does a much better job of explaining this than I could ever do here - https://www.discoveringstatistics.com/tag/homoscedasticity/ - but essentially, if your errors do not have constant variance, then your parameter estimates could be wrong, as could the estimates of their significance.
The best way to check for homo/hetroscedasticity is to plot the residuals in the model against the predicted values. We are looking for a cloud of points with no apparent patterning to them.




In the series of plots above, the first plot (residuals vs fitted), we would hope to find a random cloud of points with a straight horizontal red line. Looking at the plot, the curved red line would suggest some hetroscedasticity, but the cloud looks quite random. Similarly we are looking for a random cloud of points with no apparent patterning or shape in the third plot of standardised residuals vs fitted values. Here, the cloud of points also looks fairly random, with perhaps some shaping indicated by the red line.
10.0.7 Assumption 5 - Independence of Errors
This assumption simply states that the residual values (errors) in your model must not be correlated in any way. If they are, then they exhibit autocorrelation which suggests that something might be going on in the background that we have not sufficiently accounted for in our model.
10.0.7.1 Standard Autocorrelation
If you are running a regression model on data that do not have explicit space or time dimensions, then the standard test for autocorrelation would be the Durbin-Watson test.
This tests whether residuals are correlated and produces a summary statistic that ranges between 0 and 4, with 2 signifiying no autocorrelation. A value greater than 2 suggesting negative autocorrelation and and value of less than 2 indicating postitve autocorrelation.
In his excellent text book - https://www.discoveringstatistics.com/books/discovering-statistics-using-r/ - Andy Field suggests that you should be concerned with Durbin-Watson test statistics <1 or >3. So let’s see:
## lag Autocorrelation D-W Statistic p-value
## 1 0.1930199 1.61264 0
## Alternative hypothesis: rho != 0
As you can see, the DW statistics for our model is 1.61, so some indication of autocorrelation, but perhaps nothing to worry about.
HOWEVER
We are using spatially referenced data and as such we should check for spatial-autocorrelation.
The first test we should carry out is to map the residuals to see if there are any apparent obvious patterns:
#now plot the residuals
tmap_mode("view")
#qtm(LonWardProfiles, fill = "model1_resids")
tm_shape(LonWardProfiles) +
tm_polygons("model2_resids",
palette = "RdYlBu") +
tm_shape(lond_sec_schools_sf) + tm_dots(col = "TYPE")Looking at the map above, there look to be some blue areas next to other blue areas and some red/orange areas next to other red/orange areas. This suggests that there could well be some spatial autocorrelation biasing our model, but can we test for spatial autocorrelation more systematically?
Yes - and some of you will remember this from the practical two weeks ago. We can calculate a number of different statistics to check for spatial autocorrelation - the most common of these being Moran’s I.
Let’s do that now.
#####
#Firstly convert our SF object into an SP object:
LonWardProfilesSP <- as(LonWardProfiles,"Spatial")
names(LonWardProfilesSP)## [1] "NAME"
## [2] "GSS_CODE"
## [3] "HECTARES"
## [4] "NONLD_AREA"
## [5] "LB_GSS_CD"
## [6] "BOROUGH"
## [7] "POLY_ID"
## [8] "Ward.name"
## [9] "Old.code"
## [10] "Population...2015"
## [11] "Children.aged.0.15...2015"
## [12] "Working.age..16.64....2015"
## [13] "Older.people.aged.65....2015"
## [14] "X..All.Children.aged.0.15...2015"
## [15] "X..All.Working.age..16.64....2015"
## [16] "X..All.Older.people.aged.65....2015"
## [17] "Mean.Age...2013"
## [18] "Median.Age...2013"
## [19] "Area...Square.Kilometres"
## [20] "Population.density..persons.per.sq.km....2013"
## [21] "X..BAME...2011"
## [22] "X..Not.Born.in.UK...2011"
## [23] "X..English.is.First.Language.of.no.one.in.household...2011"
## [24] "General.Fertility.Rate...2013"
## [25] "Male.life.expectancy..2009.13"
## [26] "Female.life.expectancy..2009.13"
## [27] "X..children.in.reception.year.who.are.obese...2011.12.to.2013.14"
## [28] "X..children.in.year.6.who.are.obese..2011.12.to.2013.14"
## [29] "Rate.of.All.Ambulance.Incidents.per.1.000.population...2014"
## [30] "Rates.of.ambulance.call.outs.for.alcohol.related.illness...2014"
## [31] "Number.Killed.or.Seriously.Injured.on.the.roads...2014"
## [32] "In.employment..16.64....2011"
## [33] "Employment.rate..16.64....2011"
## [34] "Number.of.jobs.in.area...2013"
## [35] "Employment.per.head.of.resident.WA.population...2013"
## [36] "Rate.of.new.registrations.of.migrant.workers...2011.12"
## [37] "Median.House.Price.......2014"
## [38] "Number.of.properties.sold...2014"
## [39] "Median.Household.income.estimate..2012.13."
## [40] "Number.of.Household.spaces...2011"
## [41] "X..detached.houses...2011"
## [42] "X..semi.detached.houses...2011"
## [43] "X..terraced.houses...2011"
## [44] "X..Flat..maisonette.or.apartment...2011"
## [45] "X..Households.Owned...2011"
## [46] "X..Households.Social.Rented...2011"
## [47] "X..Households.Private.Rented...2011"
## [48] "X..dwellings.in.council.tax.bands.A.or.B...2015"
## [49] "X..dwellings.in.council.tax.bands.C..D.or.E...2015"
## [50] "X..dwellings.in.council.tax.bands.F..G.or.H...2015"
## [51] "Claimant.rate.of.key.out.of.work.benefits..working.age.client.group...2014."
## [52] "Claimant.Rate.of.Housing.Benefit..2015."
## [53] "Claimant.Rate.of.Employment.Support.Allowance...2014"
## [54] "Rate.of.JobSeekers.Allowance..JSA..Claimants...2015"
## [55] "X..dependent.children..0.18..in.out.of.work.households...2014"
## [56] "X..of.households.with.no.adults.in.employment.with.dependent.children...2011"
## [57] "X..of.lone.parents.not.in.employment...2011"
## [58] "X.ID2010....Rank.of.average.score..within.London....2010"
## [59] "X.ID2010....of.LSOAs.in.worst.50..nationally...2010"
## [60] "Average.GCSE.capped.point.scores...2014"
## [61] "Unauthorised.Absence.in.All.Schools.......2013"
## [62] "X..with.no.qualifications...2011"
## [63] "X..with.Level.4.qualifications.and.above...2011"
## [64] "A.Level.Average.Point.Score.Per.Student...2013.14"
## [65] "A.Level.Average.Point.Score.Per.Entry..2013.14"
## [66] "Crime.rate...2014.15"
## [67] "Violence.against.the.person.rate...2014.15"
## [68] "Deliberate.Fires.per.1.000.population...2014"
## [69] "X..area.that.is.open.space...2014"
## [70] "Cars.per.household...2011"
## [71] "Average.Public.Transport.Accessibility.score...2014"
## [72] "X..travel.by.bicycle.to.work...2011"
## [73] "Turnout.at.Mayoral.election...2012"
## [74] "model1_resids"
## [75] "model2_resids"
#and calculate the centroids of all Wards in London
coordsW <- coordinates(LonWardProfilesSP)
plot(coordsW)
#Now we need to generate a spatial weights matrix (remember from the lecture a couple of weeks ago). We'll start with a simple binary matrix of queen's case neighbours
#create a neighbours list of queens contiguity
LWard_nb <- poly2nb(LonWardProfilesSP, queen=T)
#or nearest neighbours
knn_wards <- knearneigh(coordsW, k=4)## Warning in knearneigh(coordsW, k = 4): knearneigh: identical points found



#create a spatial weights matrix object from these weights
Lward.queens_weight <- nb2listw(LWard_nb, style="C")
Lward.knn_4_weight <- nb2listw(LWard_knn, style="C")
#now run a moran's I test on the residuals
#first using queens neighbours
moran.test(LonWardProfilesSP@data$model2_resids, Lward.queens_weight)##
## Moran I test under randomisation
##
## data: LonWardProfilesSP@data$model2_resids
## weights: Lward.queens_weight
##
## Moran I statistic standard deviate = 11.668, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.2678354337 -0.0016000000 0.0005332379
##
## Moran I test under randomisation
##
## data: LonWardProfilesSP@data$model2_resids
## weights: Lward.knn_4_weight
##
## Moran I statistic standard deviate = 10.938, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.2915804046 -0.0016000000 0.0007183984
Observing the Moran’s I statistic for both Queen’s case neighbours and k-nearest neighbours of 4, we can see that the Moran’s I statistic is somewhere between 0.27 and 0.29. Remembering that Moran’s I ranges from between -1 and +1 (0 indicating no spatial autocorrelation) we can conclude that there is some weak to moderate spatial autocorrelation in our residuals.
This means that despite passing most of the assumptions of linear regression, we could have a situation here where the presence of some spatial autocorrelation could be leading to biased estimates of our parameters and significance values.
10.1 Spatial Regression Models
10.1.2 Spatial regression model in R
We will now read in some extra data which we will use shortly
extradata <- read_csv("https://www.dropbox.com/s/qay9q1jwpffxcqj/LondonAdditionalDataFixed.csv?raw=1")
#add the extra data too
LonWardProfiles <- left_join(LonWardProfiles, extradata, by = c("GSS_CODE" = "Wardcode"))
names(LonWardProfiles)## [1] "NAME"
## [2] "GSS_CODE"
## [3] "HECTARES"
## [4] "NONLD_AREA"
## [5] "LB_GSS_CD"
## [6] "BOROUGH"
## [7] "POLY_ID"
## [8] "Ward name"
## [9] "Old code"
## [10] "Population - 2015"
## [11] "Children aged 0-15 - 2015"
## [12] "Working-age (16-64) - 2015"
## [13] "Older people aged 65+ - 2015"
## [14] "% All Children aged 0-15 - 2015"
## [15] "% All Working-age (16-64) - 2015"
## [16] "% All Older people aged 65+ - 2015"
## [17] "Mean Age - 2013"
## [18] "Median Age - 2013"
## [19] "Area - Square Kilometres"
## [20] "Population density (persons per sq km) - 2013"
## [21] "% BAME - 2011"
## [22] "% Not Born in UK - 2011"
## [23] "% English is First Language of no one in household - 2011"
## [24] "General Fertility Rate - 2013"
## [25] "Male life expectancy -2009-13"
## [26] "Female life expectancy -2009-13"
## [27] "% children in reception year who are obese - 2011/12 to 2013/14"
## [28] "% children in year 6 who are obese- 2011/12 to 2013/14"
## [29] "Rate of All Ambulance Incidents per 1,000 population - 2014"
## [30] "Rates of ambulance call outs for alcohol related illness - 2014"
## [31] "Number Killed or Seriously Injured on the roads - 2014"
## [32] "In employment (16-64) - 2011"
## [33] "Employment rate (16-64) - 2011"
## [34] "Number of jobs in area - 2013"
## [35] "Employment per head of resident WA population - 2013"
## [36] "Rate of new registrations of migrant workers - 2011/12"
## [37] "Median House Price (£) - 2014"
## [38] "Number of properties sold - 2014"
## [39] "Median Household income estimate (2012/13)"
## [40] "Number of Household spaces - 2011"
## [41] "% detached houses - 2011"
## [42] "% semi-detached houses - 2011"
## [43] "% terraced houses - 2011"
## [44] "% Flat, maisonette or apartment - 2011"
## [45] "% Households Owned - 2011"
## [46] "% Households Social Rented - 2011"
## [47] "% Households Private Rented - 2011"
## [48] "% dwellings in council tax bands A or B - 2015"
## [49] "% dwellings in council tax bands C, D or E - 2015"
## [50] "% dwellings in council tax bands F, G or H - 2015"
## [51] "Claimant rate of key out-of-work benefits (working age client group) (2014)"
## [52] "Claimant Rate of Housing Benefit (2015)"
## [53] "Claimant Rate of Employment Support Allowance - 2014"
## [54] "Rate of JobSeekers Allowance (JSA) Claimants - 2015"
## [55] "% dependent children (0-18) in out-of-work households - 2014"
## [56] "% of households with no adults in employment with dependent children - 2011"
## [57] "% of lone parents not in employment - 2011"
## [58] "(ID2010) - Rank of average score (within London) - 2010"
## [59] "(ID2010) % of LSOAs in worst 50% nationally - 2010"
## [60] "Average GCSE capped point scores - 2014"
## [61] "Unauthorised Absence in All Schools (%) - 2013"
## [62] "% with no qualifications - 2011"
## [63] "% with Level 4 qualifications and above - 2011"
## [64] "A-Level Average Point Score Per Student - 2013/14"
## [65] "A-Level Average Point Score Per Entry; 2013/14"
## [66] "Crime rate - 2014/15"
## [67] "Violence against the person rate - 2014/15"
## [68] "Deliberate Fires per 1,000 population - 2014"
## [69] "% area that is open space - 2014"
## [70] "Cars per household - 2011"
## [71] "Average Public Transport Accessibility score - 2014"
## [72] "% travel by bicycle to work - 2011"
## [73] "Turnout at Mayoral election - 2012"
## [74] "model1_resids"
## [75] "model2_resids"
## [76] "WardName"
## [77] "WardCode"
## [78] "PctSharedOwnership2011"
## [79] "PctRentFree2011"
## [80] "Candidate"
## [81] "InnerOuter"
## [82] "x"
## [83] "y"
## [84] "AvgGCSE2011"
## [85] "UnauthAbsenceSchools11"
## [86] "geometry"
10.1.3 Extending your regression model - Dummy Variables
What if instead of fitting one line to our cloud of points, we could fit several depending on whether the Wards we were analysing fell into some or other group. What if the relationship between attending school and achieving good exam results varied between inner and outer London, for example. Could we test for that? Well yes we can - quite easily in fact.
If we colour the points representing Wards for Inner and Outer London differently, we can start to see that there might be something interesting going on. Using 2011 data (as there are not the rounding errors that there are in the more recent data), there seems to be a stronger relationship between absence and GCSE scores in Outer London than Inner London. We can test for this in a standard linear regression model.
p <- ggplot(LonWardProfiles, aes(x=`UnauthAbsenceSchools11`, y=`AvgGCSE2011`))
p + geom_point(aes(colour = InnerOuter)) 
Dummy variables are always categorical data (inner or outer London, or red / blue etc.). When we incorporate them into a regression model, they serve the purpose of splitting our analysis into groups. In the graph above, it would mean, effectively, having a separate regression line for the red points and a separate line for the blue points.
Let’s try it!
#first, let's make sure R is reading our InnerOuter variable as a factor
LonWardProfiles$InnerOuter <- as.factor(LonWardProfiles$InnerOuter)
#now run the model
model3 <- lm(`Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013` +
log(`Median House Price (£) - 2014`) +
`InnerOuter`, data = LonWardProfiles)
summary(model3)##
## Call:
## lm(formula = `Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013` +
## log(`Median House Price (£) - 2014`) + InnerOuter, data = LonWardProfiles)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72.126 -9.874 -0.156 8.643 49.770
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 97.587 24.058 4.056
## `Unauthorised Absence in All Schools (%) - 2013` -30.097 2.028 -14.842
## log(`Median House Price (£) - 2014`) 19.834 1.742 11.384
## InnerOuterOuter 10.932 1.509 7.243
## Pr(>|t|)
## (Intercept) 5.62e-05 ***
## `Unauthorised Absence in All Schools (%) - 2013` < 2e-16 ***
## log(`Median House Price (£) - 2014`) < 2e-16 ***
## InnerOuterOuter 1.30e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.93 on 622 degrees of freedom
## Multiple R-squared: 0.5235, Adjusted R-squared: 0.5212
## F-statistic: 227.8 on 3 and 622 DF, p-value: < 2.2e-16
So how can we interpret this?
Well, the dummy variable is statistically significant and the coefficient tells us the difference between the two groups (Inner and Outer London) we are comparing. In this case, it is telling us that living in a Ward in outer London will improve your average GCSE score by 10.93 points, on average, compared to if you lived in Inner London. The R-squared has increased slightly, but not by much.
You will notice that despite there being two values in our dummy variable (Inner and Outer), we only get one coefficient. This is because with dummy variables, one value is always considered to be the control (comparison/reference) group. In this case we are comparing Outer London to Inner London. If our dummy variable had more than 2 levels we would have more coefficients, but always one as the reference.
The order in which the dummy comparisons are made is determined by what is known as a ‘contrast matrix’. This determines the treatment group (1) and the control (reference) group (0). We can view the contrast matrix using the contrasts() function:
## Outer
## Inner 0
## Outer 1
If we want to change the reference group, there are various ways of doing this. We can use the contrasts() function, or we can use the relevel() function. Let’s try it:
LonWardProfiles$InnerOuter <- relevel(LonWardProfiles$InnerOuter, ref="Outer")
model3 <- lm(`Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013` +
log(`Median House Price (£) - 2014`) +
`InnerOuter`, data = LonWardProfiles)
summary(model3)##
## Call:
## lm(formula = `Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013` +
## log(`Median House Price (£) - 2014`) + InnerOuter, data = LonWardProfiles)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72.126 -9.874 -0.156 8.643 49.770
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 108.518 23.189 4.680
## `Unauthorised Absence in All Schools (%) - 2013` -30.097 2.028 -14.842
## log(`Median House Price (£) - 2014`) 19.834 1.742 11.384
## InnerOuterInner -10.932 1.509 -7.243
## Pr(>|t|)
## (Intercept) 3.53e-06 ***
## `Unauthorised Absence in All Schools (%) - 2013` < 2e-16 ***
## log(`Median House Price (£) - 2014`) < 2e-16 ***
## InnerOuterInner 1.30e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.93 on 622 degrees of freedom
## Multiple R-squared: 0.5235, Adjusted R-squared: 0.5212
## F-statistic: 227.8 on 3 and 622 DF, p-value: < 2.2e-16
You will notice that the only thing that has changed in the model is that the coefficient for the InnerOuter variable now relates to Inner London and is now negative (meaning that living in Inner London is likely to reduce your average GCSE score by 10.93 points compared to Outer London). The rest of the model is exactly the same.
10.1.4 TASK: Investigating Further - Adding More Explanatory Variables into a multiple regression model
Now it’s your turn. You have been shown how you could begin to model average GCSE scores across London, but the models we have built so far have been fairly simple in terms of explanatory variables.
You should try and build the optimum model of GCSE performance from your data in your LondonWards dataset. Experiment with adding different variables - when building a regression model in this way, you are trying to hit a sweet spot between increasing your R-squared value as much as possible, but with as few explanatory variables as possible.
10.1.4.1 A few things to watch out for…
You should never just throw variables at a model without a good theoretical reason for why they might have an influence. Choose your variables carefully!
Be prepared to take variables out of your model either if a new variable confounds (becomes more important than) earlier variables or turns out not to be significant.
For example, let’s try adding the rate of drugs related crime (logged as it is a positively skewed variable, where as the log is normal) and the number of cars per household… are these variables significant? What happens to the spatial errors in your models?
10.2 Task 3 - Spatial Non-stationarity and Geographically Weighted Regression Models (GWR)
Here’s my final model from the last section:
#select some variables from the data file
myvars <- c("Average GCSE capped point scores - 2014","Unauthorised Absence in All Schools (%) - 2013","Median House Price (£) - 2014","Rate of JobSeekers Allowance (JSA) Claimants - 2015", "% with Level 4 qualifications and above - 2011")
#check their correlations are OK
cormat <- cor(tempdf[myvars], use="complete.obs", method="pearson")
#run a final OLS model
model_final <- lm(`Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013` +
log(`Median House Price (£) - 2014`) + `InnerOuter` + `Rate of JobSeekers Allowance (JSA) Claimants - 2015` + `% with Level 4 qualifications and above - 2011`, data = LonWardProfiles)
summary(model_final)##
## Call:
## lm(formula = `Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013` +
## log(`Median House Price (£) - 2014`) + InnerOuter + `Rate of JobSeekers Allowance (JSA) Claimants - 2015` +
## `% with Level 4 qualifications and above - 2011`, data = LonWardProfiles)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.516 -9.083 0.164 9.163 45.572
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 239.93832 27.93079
## `Unauthorised Absence in All Schools (%) - 2013` -23.61673 2.16122
## log(`Median House Price (£) - 2014`) 8.41359 2.26238
## InnerOuterInner -10.36905 1.64637
## `Rate of JobSeekers Allowance (JSA) Claimants - 2015` -2.81347 0.63516
## `% with Level 4 qualifications and above - 2011` 0.41275 0.07836
## t value Pr(>|t|)
## (Intercept) 8.590 < 2e-16 ***
## `Unauthorised Absence in All Schools (%) - 2013` -10.927 < 2e-16 ***
## log(`Median House Price (£) - 2014`) 3.719 0.000218 ***
## InnerOuterInner -6.298 5.71e-10 ***
## `Rate of JobSeekers Allowance (JSA) Claimants - 2015` -4.430 1.12e-05 ***
## `% with Level 4 qualifications and above - 2011` 5.268 1.91e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.24 on 620 degrees of freedom
## Multiple R-squared: 0.568, Adjusted R-squared: 0.5645
## F-statistic: 163 on 5 and 620 DF, p-value: < 2.2e-16




LonWardProfilesSP <- as(LonWardProfiles,"Spatial")
moran.test(LonWardProfilesSP@data$model_final_res, Lward.knn_4_weight)##
## Moran I test under randomisation
##
## data: LonWardProfilesSP@data$model_final_res
## weights: Lward.knn_4_weight
##
## Moran I statistic standard deviate = 8.4186, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.2240092085 -0.0016000000 0.0007181894
Now, we probably could stop at running a spatial error model at this point, but it could be that rather than spatial autocorrelation causing problems with our model, it might be that a “global” regression model does not capture the full story. In some parts of our study area, the relationships between the dependent and independent variable may not exhibit the same slope coefficient. While, for example, increases in unauthorised absence usually are negatively correlated with GCSE score (students missing school results in lower exam scores), in some parts of the city, they could be positively correlated (in affluent parts of the city, rich parents may enrol their children for just part of the year and then live elsewhere in the world for another part of the year, leading to inflated unauthorised absence figures. Ski holidays are cheaper during the school term, but the pupils will still have all of the other advantages of living in a well off household that will benefit their exam scores.
If this occurs, then we have ‘non-stationarity’ - this is when the global model does not represent the relationships between variables that might vary locally.
This part of the practical will only skirt the edges of GWR, for much more detail you should visit the GWR website which is produced and maintained by Prof Chris Brunsdon and Dr Martin Charlton who originally developed the technique - http://gwr.nuim.ie/
There are various packages which will carry out GWR in R, for this pracical we we use spgwr (mainly because it was the first one I came across), although you could also use GWmodel or gwrr. At the end of this practical, you can test out these ideas in ArcGIS using the GWR toolset in the Spatial Statistics Toolbox - http://resources.arcgis.com/en/help/main/10.1/index.html#/Geographically_Weighted_Regression_GWR/005p00000021000000/
I should also acknowledge the guide on GWR (accessible here: http://www.bris.ac.uk/cmpo/events/2009/segregation/gwr.pdf) produced by the University of Bristol, which was a great help when producing this exercise.
library(spgwr)
#calculate kernel bandwidth
GWRbandwidth <- gwr.sel(`Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013` + log(`Median House Price (£) - 2014`) + `InnerOuter` + `Rate of JobSeekers Allowance (JSA) Claimants - 2015` + `% with Level 4 qualifications and above - 2011`, data = LonWardProfiles, coords=coordsW,adapt=T)## Adaptive q: 0.381966 CV score: 124832.2
## Adaptive q: 0.618034 CV score: 126396.1
## Adaptive q: 0.236068 CV score: 122752.4
## Adaptive q: 0.145898 CV score: 119960.5
## Adaptive q: 0.09016994 CV score: 116484.6
## Adaptive q: 0.05572809 CV score: 112628.7
## Adaptive q: 0.03444185 CV score: 109427.7
## Adaptive q: 0.02128624 CV score: 107562.9
## Adaptive q: 0.01315562 CV score: 108373.2
## Adaptive q: 0.02161461 CV score: 107576.6
## Adaptive q: 0.0202037 CV score: 107505.1
## Adaptive q: 0.01751157 CV score: 107333
## Adaptive q: 0.01584775 CV score: 107175.5
## Adaptive q: 0.01481944 CV score: 107564.8
## Adaptive q: 0.01648327 CV score: 107187.9
## Adaptive q: 0.01603246 CV score: 107143.9
## Adaptive q: 0.01614248 CV score: 107153.1
## Adaptive q: 0.01607315 CV score: 107147.2
## Adaptive q: 0.01596191 CV score: 107143
## Adaptive q: 0.01592122 CV score: 107154.4
## Adaptive q: 0.01596191 CV score: 107143
#run the gwr model
gwr.model = gwr(`Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013` + log(`Median House Price (£) - 2014`) + `InnerOuter` + `Rate of JobSeekers Allowance (JSA) Claimants - 2015` + `% with Level 4 qualifications and above - 2011`, data = LonWardProfiles, coords=coordsW, adapt=GWRbandwidth, hatmatrix=TRUE, se.fit=TRUE)
#print the results of the model
gwr.model## Call:
## gwr(formula = `Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013` +
## log(`Median House Price (£) - 2014`) + InnerOuter + `Rate of JobSeekers Allowance (JSA) Claimants - 2015` +
## `% with Level 4 qualifications and above - 2011`, data = LonWardProfiles,
## coords = coordsW, adapt = GWRbandwidth, hatmatrix = TRUE,
## se.fit = TRUE)
## Kernel function: gwr.Gauss
## Adaptive quantile: 0.01596191 (about 9 of 626 data points)
## Summary of GWR coefficient estimates at data points:
## Min. 1st Qu.
## X.Intercept. -345.016491 105.265894
## X.Unauthorised.Absence.in.All.Schools.......2013. -47.061203 -30.804616
## log..Median.House.Price.......2014.. -23.968275 -0.027145
## InnerOuterInner -25.997740 -10.133154
## X.Rate.of.JobSeekers.Allowance..JSA..Claimants...2015. -18.958945 -5.511328
## X...with.Level.4.qualifications.and.above...2011. -1.175642 0.055792
## Median 3rd Qu.
## X.Intercept. 206.279057 337.189642
## X.Unauthorised.Absence.in.All.Schools.......2013. -22.971543 -14.344216
## log..Median.House.Price.......2014.. 10.924285 18.809942
## InnerOuterInner -5.871914 -1.756184
## X.Rate.of.JobSeekers.Allowance..JSA..Claimants...2015. -3.278698 -1.114797
## X...with.Level.4.qualifications.and.above...2011. 0.509708 0.832212
## Max. Global
## X.Intercept. 699.850993 239.9383
## X.Unauthorised.Absence.in.All.Schools.......2013. 6.798700 -23.6167
## log..Median.House.Price.......2014.. 44.788735 8.4136
## InnerOuterInner 11.694249 -10.3690
## X.Rate.of.JobSeekers.Allowance..JSA..Claimants...2015. 20.033764 -2.8135
## X...with.Level.4.qualifications.and.above...2011. 3.042313 0.4127
## Number of data points: 626
## Effective number of parameters (residual: 2traceS - traceS'S): 160.9269
## Effective degrees of freedom (residual: 2traceS - traceS'S): 465.0731
## Sigma (residual: 2traceS - traceS'S): 12.35905
## Effective number of parameters (model: traceS): 116.0071
## Effective degrees of freedom (model: traceS): 509.9929
## Sigma (model: traceS): 11.80222
## Sigma (ML): 10.65267
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 5026.882
## AIC (GWR p. 96, eq. 4.22): 4854.513
## Residual sum of squares: 71038.1
## Quasi-global R2: 0.7557128
The output from the GWR model reveals how the coefficients vary across the 625 Wards in our London Study region. You will see how the global coefficients are exactly the same as the coefficients in the earlier lm model. In this particular model (yours will look a little different if you have used different explanatory variables), if we take unauthorised absence from school, we can see that the coefficients range from a minimum value of -47.06 (1 unit change in unauthorised absence resulting in a drop in average GCSE score of -47.06) to +6.8 (1 unit change in unauthorised absence resulting in an increase in average GCSE score of +6.8). For half of the wards in the dataset, as unauthorised absence rises by 1 point, GCSE scores will decrease between -30.80 and -14.34 points (the inter-quartile range between the 1st Qu and the 3rd Qu).
You will notice that the R-Squared value has improved - this is not uncommon for GWR models, but it doesn’t necessarily mean they are definitely better than global models. The small number of cases under the kernel means that GW models have been criticised for lacking statistical robustness.
Coefficient ranges can also be seen for the other variables and they suggest some interesting spatial patterning. To explore this we can plot the GWR coefficients for different variables. Firstly we can attach the coefficients to our original dataframe - this can be achieved simply as the coefficients for each ward appear in the same order in our spatial points dataframe as they do in the original dataframe.
## [1] "sum.w"
## [2] "X.Intercept."
## [3] "X.Unauthorised.Absence.in.All.Schools.......2013."
## [4] "log..Median.House.Price.......2014.."
## [5] "InnerOuterInner"
## [6] "X.Rate.of.JobSeekers.Allowance..JSA..Claimants...2015."
## [7] "X...with.Level.4.qualifications.and.above...2011."
## [8] "X.Intercept._se"
## [9] "X.Unauthorised.Absence.in.All.Schools.......2013._se"
## [10] "log..Median.House.Price.......2014.._se"
## [11] "InnerOuterInner_se"
## [12] "X.Rate.of.JobSeekers.Allowance..JSA..Claimants...2015._se"
## [13] "X...with.Level.4.qualifications.and.above...2011._se"
## [14] "gwr.e"
## [15] "pred"
## [16] "pred.se"
## [17] "localR2"
## [18] "X.Intercept._se_EDF"
## [19] "X.Unauthorised.Absence.in.All.Schools.......2013._se_EDF"
## [20] "log..Median.House.Price.......2014.._se_EDF"
## [21] "InnerOuterInner_se_EDF"
## [22] "X.Rate.of.JobSeekers.Allowance..JSA..Claimants...2015._se_EDF"
## [23] "X...with.Level.4.qualifications.and.above...2011._se_EDF"
## [24] "pred.se.1"
## [25] "coord.x"
## [26] "coord.y"
#attach coefficients to original dataframe
LonWardProfilesSP@data$coefUnauthAbs<-results$X.Unauthorised.Absence.in.All.Schools.......2013.
LonWardProfilesSP@data$coefHousePrice<-results$log..Median.House.Price.......2014..
LonWardProfilesSP@data$coefJSA<-results$X.Rate.of.JobSeekers.Allowance..JSA..Claimants...2015.
LonWardProfilesSP@data$coefLev4Qual<-results$X...with.Level.4.qualifications.and.above...2011.Now how would you plot the House price coeffeicent, Job seekers allowance and level 4 qualification coefficient?
Taking the first plot, which is for the unauthorised absence coefficients, we can see that for the majority of boroughs in London, there is the negative relationship we would expect - i.e. as unauthorised absence goes up, exam scores go down. For three boroughs (Westminster, Kensington & Chelsea and Hammersmith and Fulham, as well as an area near Bexleyheath in South East London - some of the richest in London), however, the relationship is positive - as unauthorised school absence increases, so does average GCSE score. This is a very interesting pattern and counterintuitive pattern, but may partly be explained the multiple homes owned by many living in these boroughs (students living in different parts of the country and indeed the world for significant periods of the year), foreign holidays and the over representation of private schooling of those living in these areas. If this is not the case and unauthorised absence from school is reflecting the unauthorised absence of poorer students attending local, inner city schools, then high GCSE grades may also reflect the achievements of those who are sent away to expensive fee-paying schools elsewhere in the country and who return to their parental homes later in the year. Either way, these factors could explain these results.
Of course, these results may not be statistically significant across the whole of London. Roughly speaking, if a coefficient estimate is more than 2 standard errors away from zero, then it is “statistically significant”.
To calculate standard errors, for each variable you can use a formula similar to this:
#**NOTE** when you run the code below, it's likely that column headers with `column_names` labelled with lots of quotations - as happens when using read_csv, errors will be caused when trying to reference these in gwr.model. To get over these errors, rename your columns as something simple and run the regression again.
#run the significance test
sigTest = abs(gwr.model$SDF$"log(`Median House Price (£) - 2014`)") -2 * gwr.model$SDF$"log(`Median House Price (£) - 2014`)_se"
#store significance results
LonWardProfilesSP$GWRUnauthSig<-sigTestIf this is greater than zero (i.e. the estimate is more than two standard errors away from zero), it is very unlikely that the true value is zero, i.e. it is statistically significant (at nearly the 95% confidence level)
You should now calculate these for each variable in your GWR model and See if you can plot them on a map, for example:
From the results of your GWR exercise, what are you able to conclude about the geographical variation in your explanatory variables when predicting your dependent variable?
10.3 ArcGIS
As mentioned at the beginning of this exercise, GWR can also be carried out using the Spatial Statistics Toolbox in ArcGIS. Using the guide in the ArcGIS help system below, try and repeat your GWR analysis using the tools available in Arc - http://resources.arcgis.com/en/help/main/10.1/index.html#/Geographically_Weighted_Regression_GWR/005p00000021000000/